Uniformize model processors (models w/o special arg names) #32845

leloykun · 2024-08-16T09:22:35Z

What does this PR do?

Uniformizes kwargs for processors of AltCLIP, Flava, Git, InstructBlipVideo, LLaVa-NeXT-Video, MGP, Siglip, TVP, VideoLLaVa, VILT, X-CLIP as discussed in Uniform kwargs for processors #31911

TODO:

Fixes # (issue)

Uniform kwargs for processors #31911

Who can review?

@zucchini-nlp @molbap @NielsRogge

molbap · 2024-08-16T09:35:24Z

src/transformers/models/clipseg/processing_clipseg.py

+        if output_kwargs["text_kwargs"].get("visual_prompt") is not None and audio is not None:
+            raise ValueError(
+                "You cannot provide `visual_prompt` as a positional argument and as a keyword argument at the same time."
+                "Please provide it only as a keyword argument (i.e. `visual_prompt=...`)."
+            )
+        if "visual_prompt" not in output_kwargs["text_kwargs"]:
+            warnings.warn(
+                "No `visual_prompt` kwarg was detected. The use of `visual_prompt` as an argument without specifying it explicitely as `visual_prompt=` will be deprecated in future versions."
+            )
+            # For backwards compatibility, we reuse `audio` as `visual_prompt` in case
+            # downstream users passed it as a positional argument
+            if audio is not None:
+                output_kwargs["text_kwargs"]["visual_prompt"] = audio
+
+        visual_prompt = output_kwargs["text_kwargs"].pop("visual_prompt", None)


same remark as in #32841 , we should not be using kwargs for a purpose other than the one advertised!

@molbap , I just moved the models w/ special arg names to this PR: #32841 so they'd be in one place.

This PR should now be ready for review.

…lipvideo, llava_next, llava_next_video, siglip, video_llava, vilt

zucchini-nlp

Thanks for this awesome work! I left some general comments, mostly nits about cleaning up docstring and clarification questions for my own uderstanding.

Additionally, we have to return BatchFeature and not BatchEncoding in processor's call
and Chameleon needs to be removed from PR, as it now has its own PR

src/transformers/models/flava/processing_flava.py

src/transformers/models/altclip/processing_altclip.py

zucchini-nlp · 2024-08-19T05:32:53Z

src/transformers/models/instructblipvideo/processing_instructblipvideo.py

+        images: Optional[VideoInput] = None,
+        text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
+        audio=None,
+        videos=None,


Hmm, this is quite interesting since we accept videos as argument but we still leave it as images = VideoInput.

Oke, let's leave it for BC and I need to spend some time on making new video-models more standard, currently some work with images and some with videos. But that will not be soon I guess

@zucchini-nlp I've made these changes:

we now allow the user to use videos instead of images

using images now results in a future warning

using both videos and images now results in an error

src/transformers/models/instructblipvideo/processing_instructblipvideo.py

src/transformers/models/video_llava/processing_video_llava.py

zucchini-nlp · 2024-08-19T05:57:37Z

tests/models/chameleon/test_processor_chameleon.py

+        if isinstance(component_class_name, tuple):
+            if "_fast" in component_class_name[0]:
+                component_class_name = component_class_name[0]
+            else:
+                component_class_name = component_class_name[1]


Not very clear why we had to overwrite this, is this because we need to work only with FastTokenizers?

If that's the case, usually there should be no difference in how text is encoded with fast vs slow tokenizers, so we might rather dig why there's a difference

I'll make the Chameleon-related changes in this PR: #32181

After that gets merged, I'll rebase this to main

tests/test_processing_common.py

leloykun · 2024-08-19T13:13:57Z

@zucchini-nlp I just added 4 more models:

Git
MGP
TVP
X-CLIP

zucchini-nlp

Perfect, thanks for adding them! I left some general questions but I think we will first focus and merge a smaller PR, to decide on how to handle edge cases we've been discussing

src/transformers/models/mgp_str/processing_mgp_str.py

src/transformers/models/tvp/image_processing_tvp.py

zucchini-nlp · 2024-08-20T08:17:22Z

src/transformers/models/tvp/processing_tvp.py

-                return_token_type_ids=False,
-                **kwargs,
-            )
+            textual_input = self.tokenizer(text, **output_kwargs["text_kwargs"])


this actually doesn't match with the removed lines. Previously we had some kwargs hardcoded, like trucation or padding and now users can set those to any value.

TBH I don't know why these were hardcoded but I'd leave it as is to not break anything. In that case we don't need the pad_to_max_length kwarg, as it is going to be padded to max-len no matter what

I moved them to the default values in TvpProcessorKwargs. I think that'd suffice since this part would've error-ed out anyway if downstream users passed them as kwargs too (duplicate args error).

zucchini-nlp · 2024-08-20T08:19:53Z

src/transformers/models/x_clip/processing_x_clip.py

+        return_tensors = output_kwargs["common_kwargs"].get("return_tensors")
        if text is not None and videos is not None:
-            encoding["pixel_values"] = image_features.pixel_values
-            return encoding
+            return BatchFeature(data=dict(**encoding, **image_features), tensor_type=return_tensors)
        elif text is not None:
-            return encoding
+            return BatchFeature(data=dict(**encoding), tensor_type=return_tensors)
        else:
-            return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
+            return BatchFeature(data=dict(**image_features), tensor_type=return_tensors)


We can simplify this and other processors by doing only and assigning

encoding = image_features ={} # process inputs if present here BatchFeature(data=dict(**encoding, **image_features), tensor_type=return_tensors)

@zucchini-nlp I've implemented it slightly differently, but it's now a lot cleaner than before

@yonigozlan @molbap you guys might wanna take a look too

zucchini-nlp · 2024-08-20T08:38:26Z

tests/models/tvp/test_processor_tvp.py

+            "videos" not in inspect.signature(self.processor_class.__call__).parameters
+            or inspect.signature(self.processor_class.__call__).parameters["videos"].annotation == inspect._empty


Is this because of InstructBlipVideo? If it's only that, we better override this test for InstructBlipVideo and leave the check as "video_processor" in classes

When overriding for TVP, there's no need to check as we know if model has video processor or not. Actually seems like it's gonna be skipped because TVP has only image_processor

This is for:

InstructBlipVideo

TVP

Video Llava, &

X-Clip

Only Llava-Next-Video actually has a video_processor_class

I'm leaning on leaving this here cuz it covers more models than the original check + it significantly reduces LoCs

tests/test_processing_common.py

…ages

leloykun · 2024-08-20T10:57:45Z

tests/models/tvp/test_processor_tvp.py

+class TvpProcessorTest(ProcessorTesterMixin, unittest.TestCase):
+    from_pretrained_id = "Jiqing/tiny-random-tvp"
+    processor_class = TvpProcessor
+    videos_data_arg_name = "pixel_values"


@zucchini-nlp just a heads up cuz I remember you mentioning that you plan on reworking the text-videos models:

the output keys for video data is inconsistent; some uses pixel_values_videos while some just uses pixel_values. You'd most likely need to uniformize/standardize them too

Yes, it's me who started using pixel_values_videos hehe, thanks for letting know :)

leloykun added 2 commits August 16, 2024 15:38

uniformize kwargs of Chameleon

2f4163a

fix linter nit

2588144

leloykun mentioned this pull request Aug 16, 2024

Uniform kwargs for processors #31911

Open

40 tasks

molbap reviewed Aug 16, 2024

View reviewed changes

leloykun added 2 commits August 16, 2024 17:38

rm stride default

6454130

add tests for chameleon processor

9949e72

leloykun marked this pull request as draft August 16, 2024 09:44

fix tests

58c6b53

molbap mentioned this pull request Aug 16, 2024

Uniformize kwargs for image-text-to-text processors #32544

Open

11 tasks

leloykun added 2 commits August 16, 2024 18:34

fix chameleon tests

6592ce3

don't hardcode arg names

c4f5474

leloykun force-pushed the fc--uniformize-kwargs-the-rest branch from 8a2a2c8 to 0e8e99e Compare August 16, 2024 14:16

leloykun changed the title ~~Uniformize the kwargs for processors of ClipSeg, InstructBlipVideo, LLaVa-NeXT-Video, Owl, VideoLLaVa~~ Uniformize the kwargs for processors of ClipSeg, InstructBlipVideo, LLaVa-NeXT-Video, Owl, Siglip, VideoLLaVa Aug 16, 2024

leloykun marked this pull request as ready for review August 16, 2024 16:17

leloykun mentioned this pull request Aug 16, 2024

Uniformize processor kwargs of siglip #32842

Closed

1 task

uniformize processor kwargs of altclip, bridgetower, flava, instructb…

ce9cc73

…lipvideo, llava_next, llava_next_video, siglip, video_llava, vilt

leloykun force-pushed the fc--uniformize-kwargs-the-rest branch from 1380596 to ce9cc73 Compare August 17, 2024 06:19

leloykun changed the title ~~Uniformize the kwargs for processors of ClipSeg, InstructBlipVideo, LLaVa-NeXT-Video, Owl, Siglip, VideoLLaVa~~ Uniformize the kwargs for processors of AltClip, InstructBlipVideo, Flava, LLaVa-NeXT-Video, Siglip, VideoLLaVa, VILT Aug 17, 2024

leloykun changed the title ~~Uniformize the kwargs for processors of AltClip, InstructBlipVideo, Flava, LLaVa-NeXT-Video, Siglip, VideoLLaVa, VILT~~ Uniformize model processors (models w/o special arg names) Aug 17, 2024

fix linter issue

d325914

leloykun requested a review from molbap August 17, 2024 06:43

zucchini-nlp reviewed Aug 19, 2024

View reviewed changes

leloykun added 5 commits August 19, 2024 16:43

address @zucchini-nlp's comments

935d6e5

improve docs

39650f6

don't dw from hub for video tests

539da9d

add video processing tests for instructblipvideo & video_llava

c8b2384

add git, mgp, tvp, & x-clip

423d864

leloykun requested a review from zucchini-nlp August 19, 2024 13:14

fix docs

5fd2c32

zucchini-nlp reviewed Aug 20, 2024

View reviewed changes

address @zucchini-nlp's comments

9e00f68

leloykun force-pushed the fc--uniformize-kwargs-the-rest branch from a44a76a to 9e00f68 Compare August 20, 2024 10:21

leloykun added 2 commits August 20, 2024 18:32

simplify implementations

a2672a6

uniformize implementations of make_batched_videos and make_batched_im…

721d1c8

…ages

leloykun commented Aug 20, 2024

View reviewed changes

leloykun added 7 commits August 20, 2024 19:07

fix instructblipvideo tests

c0f3abb

fix copies

bb5debd

fix make_batched_videos

d9bc2e9

fix MGP-str

f6e7914

fix make_batched_videos

acd2c56

fix make_batched_videos

5c39f4f

fix make_batched_videos

ea06e45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniformize model processors (models w/o special arg names) #32845

Uniformize model processors (models w/o special arg names) #32845

leloykun commented Aug 16, 2024 •

edited

Loading

molbap Aug 16, 2024

leloykun Aug 17, 2024

zucchini-nlp left a comment

zucchini-nlp Aug 19, 2024

leloykun Aug 20, 2024

zucchini-nlp Aug 19, 2024

leloykun Aug 20, 2024

leloykun commented Aug 19, 2024

zucchini-nlp left a comment

zucchini-nlp Aug 20, 2024

leloykun Aug 20, 2024

zucchini-nlp Aug 20, 2024

leloykun Aug 20, 2024

zucchini-nlp Aug 20, 2024

leloykun Aug 20, 2024

leloykun Aug 20, 2024

leloykun Aug 20, 2024

zucchini-nlp Aug 20, 2024

		"videos" not in inspect.signature(self.processor_class.__call__).parameters
		or inspect.signature(self.processor_class.__call__).parameters["videos"].annotation == inspect._empty

Uniformize model processors (models w/o special arg names) #32845

Are you sure you want to change the base?

Uniformize model processors (models w/o special arg names) #32845

Conversation

leloykun commented Aug 16, 2024 • edited Loading

What does this PR do?

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leloykun commented Aug 19, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leloykun commented Aug 16, 2024 •

edited

Loading